-
Notifications
You must be signed in to change notification settings - Fork 863
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Enable generation of AOT compiled artifacts for llama2 on inf2 example #2733
[WIP] Enable generation of AOT compiled artifacts for llama2 on inf2 example #2733
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2733 +/- ##
=======================================
Coverage 72.44% 72.44%
=======================================
Files 85 85
Lines 3963 3963
Branches 58 58
=======================================
Hits 2871 2871
Misses 1088 1088
Partials 4 4 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few minor nits
@@ -78,7 +78,7 @@ huggingface-cli login | |||
|
|||
Run the `inf2_save_split_checkpoints.py` script | |||
```bash | |||
python ../util/inf2_save_split_checkpoints.py --model_name meta-llama/Llama-2-13b-hf --save_path './llama-2-13b-split' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some docs feel missing, so presumably neuron is a JIT compiler and you're warming up a cache? Or is it an AOT compiler and you are saving the compiled artifacts in which it's not really a cache but a serialized compiled model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we actually have both. The neuronx-cc
JIT cache is used if the compilation artifacts are present there and if not, the neuron persistent cache is checked to see if the compiled artifacts are present. The contents of the neuron persistent cache is what we are generating here to enable speed up for the first model load. More documentation is available here: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-features/neuron-caching.html
Will include these details in the Readme as well.
@@ -78,7 +78,7 @@ huggingface-cli login | |||
|
|||
Run the `inf2_save_split_checkpoints.py` script | |||
```bash | |||
python ../util/inf2_save_split_checkpoints.py --model_name meta-llama/Llama-2-13b-hf --save_path './llama-2-13b-split' | |||
python ../util/inf2_save_split_checkpoints.py --model_name meta-llama/Llama-2-13b-hf --save_path './llama-2-13b-split' generate_neuron_cache --neuron_cache_dir './neuron_cache' --batch_size 4 --amp 'bf16' --tp_degree 6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can you link to some official docs describing what the tp degree means?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section of the Neuron documentation has a description of what tp_degree
means
https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/api_guide.html?highlight=tp_degree#model-trace
@@ -40,6 +41,12 @@ def opt_amp_callback(model: OPTForCausalLM, dtype: torch.dtype) -> None: | |||
default="./model-splits", | |||
help="Output directory for downloaded model files", | |||
) | |||
subparsers = parser.add_subparsers(dest="subparser") | |||
parser_neuron_cache = subparsers.add_parser("generate_neuron_cache") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these don't feel like they should be required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've included the generate_neuron_cache
as an optional step when running inf2_save_split_checkpoints
so that the following arguments such as neuron_cache_dir
, batch_size
etc.. are required only if generate_neuron_cache
is specified. So, if we require only the model checkpoint we can just run python ../util/inf2_save_split_checkpoints.py --model_name meta-llama/Llama-2-13b-hf --save_path './llama-2-13b-split'
as before.
parser_neuron_cache.add_argument("--neuron_cache_dir", type=str, required=True) | ||
parser_neuron_cache.add_argument("--batch_size", type=int, required=True) | ||
parser_neuron_cache.add_argument("--amp", type=str, required=True) | ||
parser_neuron_cache.add_argument("--tp_degree", type=int, required=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tp degree and neuron cache dir could use some help statements as wel
os.environ["NEURONX_CACHE"] = "on" | ||
os.environ["NEURONX_DUMP_TO"] = create_directory_if_not_exists( | ||
args.neuron_cache_dir | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to Mike's update, NEURON_COMPILE_CACHE_URL is official setting. check https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-features/neuron-caching.html?highlight=NEURON_COMPILE_CACHE_URL#neuron-persistent-cache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested this and it seems to work with the latest SDK versions 2.14.*
and not with prior versions 2.12.*
and lower. Since this example is based on Neuron SDK version 2.12
, I believe we can retain NEURONX_DUMP_TO
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also tested loading llama2
using LlamaForSampling
with Neuron SDK 2.14
and it fails with the following error:
RuntimeError: Failed compilation with ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/neuroncc_compile_workdir/cc7c41fb-c8bf-4adf-b91f-6bb8bf5d7c04/model.MODULE_14cd248f5b2ed6a11af6+56a4bda8.hlo.pb', '--output', '/tmp/neuroncc_compile_workdir/cc7c41fb-c8bf-4adf-b91f-6bb8bf5d7c04/model.MODULE_14cd248f5b2ed6a11af6+56a4bda8.neff', '--model-type=transformer-inference', '--model-type=transformer', '--verbose=35']: 2023-10-25T17:36:47Z Too many instructions after unroll for function sg0000 !
So, will retain the example to use SDK 2.12
for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this root cause of this error is there is bug with 2.14 compiler. "NEURON_COMPILE_CACHE_URL" is still recommended by inf2 team since the old way generates too much debug log. Please keep this PR open until the bug fixing is verified on neuron sdk 2.15 compile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR will replace inf2 streamer. close this PR. |
Description
Add support for generation of model compiled artifacts ahead of time for the llama2 on inf2 example.
Type of change
Feature/Issue validation/testing